21 research outputs found
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English
normally require hand-written pronunciations. In this paper, we describe a
system for automatically obtaining pronunciations of words for which
pronunciations are not available, but for which transcribed data exists. Our
method integrates information from the letter sequence and from the acoustic
evidence. The novel aspect of the problem that we address is the problem of how
to prune entries from such a lexicon (since, empirically, lexicons with too
many entries do not tend to be good for ASR performance). Experiments on
various ASR tasks show that, with the proposed framework, starting with an
initial lexicon of several thousand words, we are able to learn a lexicon which
performs close to a full expert lexicon in terms of WER performance on test
data, and is better than lexicons built using G2P alone or with a pruning
criterion based on pronunciation probability
Semi-supervised training for automatic speech recognition
State-of-the-art automatic speech recognition (ASR) systems use sequence-level objectives like Connectionist Temporal Classification (CTC) and Lattice-free Maximum Mutual Information (LF-MMI) for training neural network-based acoustic models. These methods are known to be most effective with large size datasets with hundreds or thousands of hours of data. It is difficult to obtain large amounts of supervised data other than in a few major languages like English and Mandarin. It is also difficult to obtain supervised data in a myriad of channel and envirormental conditions. On the other hand, large amounts of unsupervised audio can be obtained fairly easily. There are enormous amounts of unsupervised data available in broadcast TV, call centers and YouTube for many different languages and in many environment conditions. The goal of this research is to discover how to best leverage the available unsupervised data for training acoustic models for ASR.
In the first part of this thesis, we extend the Maximum Mutual Information (MMI) training to the semi-supervised training scenario. We show that maximizing Negative Conditional Entropy (NCE) over lattices from unsupervised data, along with state-level Minimum Bayes Risk (sMBR) on supervised data, in a multi-task architecture gives word error rate (WER) improvements without needing any confidence-based filtering.
In the second part of this thesis, we investigate using lattice-based supervision as numerator graph to incorporate uncertainities in unsupervised data in the LF-MMI training framework. We explore various aspects of creating the numerator graph including splitting lattices for minibatch training, applying tolerance to frame-level alignments, pruning beam sizes, word LM scale and inclusion of pronunciation variants. We show that the WER recovery rate (WRR) of our proposed approach is 5-10\% absolute better than that of the baseline of using 1-best transcript as supervision, and is stable in the 40-60\% range even on large-scale setups and multiple different languages.
Finally, we explore transfer learning for the scenario where we have unsupervised data in a mismatched domain. First, we look at the teacher-student learning approach for cases where parallel data is available in source and target domains. Here, we train a "student" neural network on the target domain to mimic a "teacher" neural network on the source domain data, but using sequence-level posteriors instead of the traditional approach of using frame-level posteriors. We show that the proposed approach is very effective to deal with acoustic domain mismatch in multiple scenarios of unsupervised domain adaptation -- clean to noisy speech, 8kHz to 16kHz speech, close-talk microphone to distant microphone. Second, we investigate approaches to mitigate language domain mismatch, and show that a matched language model significantly improves WRR. We finally show that our proposed semi-supervised transfer learning approach works effectively even on large-scale unsupervised datasets with 2000 hours of audio in natural and realistic conditions
Voice-preserving Zero-shot Multiple Accent Conversion
Most people who have tried to learn a foreign language would have experienced
difficulties understanding or speaking with a native speaker's accent. For
native speakers, understanding or speaking a new accent is likewise a difficult
task. An accent conversion system that changes a speaker's accent but preserves
that speaker's voice identity, such as timbre and pitch, has the potential for
a range of applications, such as communication, language learning, and
entertainment. Existing accent conversion models tend to change the speaker
identity and accent at the same time. Here, we use adversarial learning to
disentangle accent dependent features while retaining other acoustic
characteristics. What sets our work apart from existing accent conversion
models is the capability to convert an unseen speaker's utterance to multiple
accents while preserving its original voice identity. Subjective evaluations
show that our model generates audio that sound closer to the target accent and
like the original speaker.Comment: Submitted to IEEE ICASSP 202
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Large-scale generative models such as GPT and DALL-E have revolutionized
natural language processing and computer vision research. These models not only
generate high fidelity text or image outputs, but are also generalists which
can solve tasks not explicitly taught. In contrast, speech generative models
are still primitive in terms of scale and task generalization. In this paper,
we present Voicebox, the most versatile text-guided generative model for speech
at scale. Voicebox is a non-autoregressive flow-matching model trained to
infill speech, given audio context and text, trained on over 50K hours of
speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can
perform many different tasks through in-context learning, but is more flexible
as it can also condition on future context. Voicebox can be used for mono or
cross-lingual zero-shot text-to-speech synthesis, noise removal, content
editing, style conversion, and diverse sample generation. In particular,
Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both
intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs
0.681) while being up to 20 times faster. See voicebox.metademolab.com for a
demo of the model
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we
organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6).
The new challenge revisits the previous CHiME-5 challenge and further considers
the problem of distant multi-microphone conversational speech diarization and
recognition in everyday home environments. Speech material is the same as the
previous CHiME-5 recordings except for accurate array synchronization. The
material was elicited using a dinner party scenario with efforts taken to
capture data that is representative of natural conversational speech. This
paper provides a baseline description of the CHiME-6 challenge for both
segmented multispeaker speech recognition (Track 1) and unsegmented
multispeaker speech recognition (Track 2). Of note, Track 2 is the first
challenge activity in the community to tackle an unsegmented multispeaker
speech recognition scenario with a complete set of reproducible open source
baselines providing speech enhancement, speaker diarization, and speech
recognition modules
CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings
International audienceFollowing the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules